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SUMMARY 


The finite strip method is a semi-analytical finite element process which allows for a discrete 
analysis of certain types of physical problems by discretizing the domain of the problem into finite 
strips. This method decomposes a single large problem into m smaller independent subproblems 
when m harmonic functions are employed, thus yielding natural parallelism at a very high level. 
In this paper we address vectorization and parallelization strategies for the dynamic analysis of 
simply-supported Mindlin plate bending problems and show how to prevent potential conflicts in 
memory access during the assemblage process. The vector and parallel implementations of this 
method and the performance results of a test problem under scalar, vector, and vector-concurrent 
execution modes on the Alliant FX/80 are also presented. 


INTRODUCTION 


More and more parallel computers have been developed and made available to the engineering 
and scientific computing community in recent years. To take advantage of current and future 
advanced multiprocessors, however, a great deal of efforts remain to be made in the search for effi- 
cient and parallel implementations. In this paper we address both the coarse-grain and fine-grain 
parallelism offered by the finite strip method (FSM) for the dynamic analysis of Mindlin plate 
bending problems and present our vector and parallel implementations on multiprocessors with 
vector processing capabilities. FSM, first developed in the context of thin plate bending analysis, 
is a semi-analytical finite element process [6, 22]. This method allows for a discrete analysis of 
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Figure 1: The coordinate system and sign convention. 

certain types of physical problems by discretizing their domains into finite strips, involving an ap- 
proximation of the true solution using a continuous harmonic series in one direction and piecewise § 

interpolation polynomials in the others. Because of the orthogonality properties of the harmonic 
functions in the stiffness and mass matrix formulation, FSM decomposes a problem, when appli- 
cable, into many smaller and independent subproblems which yields coarse-grain parallelism in an 
extremely easy and natural way. I 

1 

Although not as versatile as the finite element method, FSM has been applied to a wide range 
of plate, folded plate, shell, and bridge deck problems [4, 6, 7, 8, 10, 18] because of its efficiency 
and simplicity. The performance induced by the coarse-grain parallelism of this method in a 
multiprocessing environment has been shown in [9] for the static analysis of Mindlin plate problems 
and in [20] for groundwater modeling. In this paper, we report and compare the performance | 

results of our implementation for the dynamic analysis of a simply-supported rectangular Mindlin 
plate using scalar, vector, and vector-concurrent execution modes on an Alliant FX/80. e 

THE PROBLEM | 

In this section we describe briefly the mathematical modeling of Mindlin plate problems [17], 

Let Q, be the space domain in 3i 2 , V the boundary, and T the time domain. Let also the stress ■ 

resultants, generalized strains, displacements, dynamic surface loadings, and inertia forces be | 

denoted respectively by s, r, d, p, and q: 

lx 1 i 

ly W P 1 T ~phw 1 S 

Ixy , d = 9 X , p = m x , and q = ^ph?0 x 

1™ e v J m y J L . 

7 yz _ _ 

— - sk 

where p stands for the mass density (per unit volume), h the thickness of the plate, and v (v = 

w, 0 X , or 9 y ) the second derivative of v with respect to time t: v = d 2 v/dt 2 . The subscripts x, y, _ 

and z above represent the directions in the Cartesian coordinate system. The sign convention for 

the displacements and external loadings is shown in Figure 1. Neglecting the damping effect of 

the plate, the differential equations which govern the state of stress resultants, generalized strains, 

and displacements in an elastic plate can be expressed as 
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1. Equilibrium equations: Z^s + p + q = 0inft®T, subject to some appropriate 
boundary conditions on T, 

2. Stress-strain equations: s = Dr, and 

3. Strain-displacement equations: r = L 2 d. 

Here D is the material property matrix of an elastic plate. L\ and L 2 are the differential operators: 


and 


l{ 


0 0 0 d/dx d/dy 

d/dx 0 d/dy —1 0 

0 d/dy d/dx 0 -1 


0 0 0 d/dx d/dy 

—d/dx 0 —d/dy —1 0 

0 —d/dy —d/dx 0 —1 


where the superscript T denotes the transpose of a matrix. 


(1) 

( 2 ) 


For orthotropic material, the matrix D takes the form 


D = 


D x D\ 

Di Dy 

D X y 

aG x 

*Gy 


( 3 ) 


where D x , D 1 , ..., G y are the standard flexural and shear rigidities of plates and q is a modification 
coefficient to account for the deviation of shear strain distribution from uniformity [4] (a = 5/6 for 
rectangular cross section; see [21, p. 371]). The rest of the entries in D are zero. If the material 
is isotropic, then the nonzero entries take the following values: 


D x = Dy 


Eh 3 

12(1 - z/ 2 )’ 


D j — vD x , D xy 


1 - v 
2 


D. c, 


an d x — f? y 


Eh 

2(1 + 1 /) 


where E, h, and v represent the material modulus, plate thickness, and Poisson’s ratio, respec- 
tively. The total potential energy of the plate due to the dynamic surface loading p [17, 16, 14] 
can be written as 

n =[*(-[ (L 2 d) T D(L 2 d) dvt—i p r d dn - 1 f d r Ad <m) dt (4) 

Jo V 2 Jq Ju 2 Jq J 


where d = dd/dt and A = diag -ph, jj,ph 3 , a diagonal matrix. 


A STRIP ELEMENT FOR MINDLIN PLATES 


We now outline the FSM formulation for the Mindlin plates using linear elements [4, 19]. 
We shall confine our discussions to rectangular Mindlin plate problems simply supported on two 
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opposite sides. Figure 2 shows a rectangular plate discretized into n — 1 finite strips. The plate is 
assumed to be simply supported on edges y = 0 and y — L y . Shown in Figure 3 is the mid-plane 
of a typical linear strip plate element of constant thickness /i, whose local coordinate system is 
denoted by ( x ', y\ z ') where x' = x - x,-, y' = y , and z' = 2 . Let fi( e ) be the domain of the t th 
strip element and i and j be the two longitudinal edges (nodal lines) of the element, as shown in 
Figure 3. Let d ^(x,y,t) and u[ e )(f) be defined as 


d ( e ){x,y,t) = [w(x,y,t) 0 x (x,y,t) 0 y {x, y, t )] T , (x,y) G ty e ) 


and 



w|(i) oj,(t) »,!(() | w‘(t) «,'(() e^t)\ T 


where w[(t) denotes the I th harmonic coefficient (amplitude) of Wi(y,t) which is the displacement 
along edge i, etc. For a linear strip element with m harmonic terms specified, the approximation 
to d( e ) is given [4, 18] by 

' ' m 

d(e)(^, <M) « £ F '( X > J/) U (e)(0 ( 5 ) 

/= 1 


with 


where Si and C\ are the 
of x, defined by 


— 


NiSi 0 0 NjS f 0 0 

0 NiSi 0 0 NjS, 0 

0 0 NiC, 0 0 NjCi J 

I th harmonic functions of y, and N, and Nj are the linear shape functions 


Si = sin 
N= 1 -^ 


l IA r - 

, 0/ cos , 

Ly Ly 


and Nj = 


1 + r (e) 
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Figure 3: A typical plate strip element. 

where r( e ), ranging from —1 to 1, is the natural coordinate in x-direction of the e th element. 
Note that r( e ) = — 1 + 2 for the element shown in Figure 3. It should be observed that the 
approximation to the displacement vector in (5) satisfies the simply supported boundary conditions 
on edges y = 0 and y = L y ; i.e., w, 0 X , dwjdx , d6 x /dx, and dO y /dy all vanish on these two edges. 
The dynamic surface loading on the e th element, p( e )(x,y,t), can often be approximated by the 
sum of a harmonic series in the longitudinal direction as shown below 

m 

P (e){x,y,t) « H, (y)P(e)0M) ( 6 ) 

/=1 

where H* = diag [Si, Si, C{\ and p| e ^ = |V m l x ^ • The subscript (e) outside the brackets 

indicates that every component of the vector is associated only with the e th element. 

Following the standard finite element procedure and taking advantage of the orthogonality 
properties of the harmonic functions, we obtain a linear algebraic differential system of block 
diagonal form [51 depicted by: 

Mii + Ku = f (7) 

where 

M = M 11 ® M 22 ® • • • © M mm and K = K 11 © K 22 © • • • © K mm 

are block diagonal matrices of the same block structure. The vectors u and f are accordingly 
partitioned, 

u T =[(u 1 ) T (u 2 ) T (u m ) T ] and f T =[(f 1 ) T (f 2 ) T ••• (f m ) r ' . 

In (7), the symbol © stands for the direct sum of square matrices. M (i , K (/ , u f , and f ; are the 
system mass matrix, system stiffness matrix, system displacement amplitude vector, and system 
load amplitude vector due to the I th harmonic mode, respectively. In the rest of the paper, we 
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shall drop the term amplitude and simply call u' (f') the I th system displacement (load) vector 
for brevity. M^ is assembled from the strip mass matrix M|(,j, f rom the strip stiffness matrix 

Kj' e) , and f* from the strip load vector f ( ' e) where 

MIL = f (F') r AF 'dn (e) , 1=1, m, (8) 

K" = f (L 2 F l 2 3 4 ) T B(L 2 F l )dn (e) , 1=1, m, (9) 

V •'Si(e) 

f W=/n (F'fH'pl.jdd,.), ( = 1, m. (10) 

£( e ) 

For a plate discretized with n nodal lines, K" and M" are square matrices of order 3n for each /. 
(K[' c j and M" e) are of order 6.) Once the entire system stiffness matrix K, system mass matrix 
M, and system load vector f are assembled and the boundary conditions imposed, the remaining 
major work is to solve the linear algebraic differential system (7) for u, u, and ii. 


PARALLEL AND VECTOR IMPLEMENTATIONS 

Computational Procedure. Similar to the finite element method, FSM normally consists of 
the following three main computational components: (1) the generation of strip stiffness/mass 
matrices and strip load vectors for all strip elements, (2) the assemblage of the entire system 
stiffness/mass matrix and system load vector, and (3) the solution process of the resulting linear 
differential system Mil + Ku = f. There are many step-by-step integration methods available 
for solving the 2nd-order linear differential equations. Among them are the central difference, 
Houbolt, Wilson 6, and Newmark (3 methods. The central difference method is an explicit scheme 
and the other three are implicit. Regardless of whether the method employed is implicit or explicit, 
the procedure basically involves an initial calculation of an effective coefficient matrix and then 
solves an effective linear system, after an effective load vector is formed, at each time step. In this 
paper, we employ the Newmark integration method whose procedure is shown below, where ao, 
a\, ■ ■ ■, a 2 are the Newmark integration constants [3, pp. 311]: 

(1) initial calculation of the effective stiffness matrix K = K + a 0 M, the factorization 
of K into LL t or LDL T form, and then for each time step tjt+i, k = 0,1, 

(2) forming the effective load vector f at time tjt+i: f*+i = f*+i -kM^oUfc+^Ui+asU*:), 

(3) solving the effective linear system at time tk+\- K^ut+i = fjt+i, 

(4) calculating the acceleration and velocity vectors iijt+i and ujt + i: 

u k+1 = a 0 (u fc+ i - ujt) - a 2 Uk - <*3U jt, ut+i = Ujt + a 6 u*: + a7Ufc+i. 

Note that the first step need be performed only once. The last three steps, however, must be 
performed at every time step and therefore constitute the most time-consuming part in the entire 
analysis. 
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To address the parallel implementation of FSM, we should first employ the decoupled structure 
of the system stiffness matrix depicted by (7), due to the orthogonality properties of harmonic 
functions. This decoupling leads to m independent sets of differential equations. Therefore, solving 
(7) is equivalent to solving 

M ,, u ; + K^u* = f ( , / = 1, m 

where K ;/ and M ,; , / = 1 , • - • , m, are block tridiagonal matrices with each block of order only 
3 x 3 for the ordering shown in Figure 2. Furthermore, each consists of only three nonzero 
diagonals. Since there is no data dependency among these m subsystems, not only can the 
generation of M^, Kj^, and and the assemblage of M (i , K ,; , and f ( for each harmonic term be 
performed independently, but all the subsystems can be solved in parallel. In a parallel computing 
environment with parallelism of two levels (considering vectorization as the first level), this special 
feature leads FSM to a fully parallelizable approach when the number of harmonic terms matches 
the number of processors. The following pseudo- Fortran code outlines its computational procedure 
and indicates where parallelism can be exploited for vector /con current executions. 


C — Initial calculations 

DO 200 /= 1 , m (concurrent, one CPU per iteration) 

DO 100 e = 1, N s (to be discussed) 

Generate Kjh, M^, and 
Assemble K , M”, and f l 
END 100 


Initialize u', u ; , and ii 1 (vector) 

Form K ,( from and (vector) 

Factorize K 11 into LL T or LDL T form (vector) 


END 200 

C — Calculations for each time step 
DO until the last time step 
DO 400 1 = 1, m 

DO 300 e = 1, N s 

Generate f/» and assemble i l 
END 300 


(concurrent, one 


(sequential) 

CPU per iteration) 
(to be discussed) 


Form effective load vector f 
Solve K^u* = f ( for u / 

Calculate ii* and u ( 

END 400 
DO 600 / = 1, m 

Accumulate displacements w for all strips 
END 600 
END DO 


(vector) 

(vector) 

(vector) 

(sequential) 

(vector-concurrent) 


In the above pseudo-code, we neglect the step of imposing boundary conditions because they 
can be performed in the generation step. The word concurrent inside the parentheses after the DO 


83 



statements is used to show that all iterations in this loop may be performed in parallel, on the basis 
of one processor per iteration ; and the word vector (or vector-concurrent ) indicates computations 
involved in the statement should be performed in vector (or vector-concurrent) mode whenever 
possible and desirable. Whether a vector operation is desirable depends on the startup overhead 
and the vector length of the operation. 

Data Structure and Parallelization. To allow current code restructures to automatically vec- 
torize or parallelize certain computations, the Fortran statements related to that part of compu- 
tations are usually written in the form of DO loops or array constructs . Potential memory access 
conflict must also be resolved. Therefore, the data structure of the code plays an essential role. In 
our implementations, the system stiffness matrix K and system mass matrix M are represented 
by two 3D arrays SK(l:nbk,l:n,l:m) and SM(l:nbm,l:n,l:m), respectively, where nbk ( nbm ) is the 
semi-bandwidth of K (M), n the number of equations in each harmonic term, and m the number 
of harmonic terms. It should be noted that in many situations, it is more beneficial to interchange 
the first two dimensions of both K and M, or to concatenate the first two dimensions into a single 
dimension. The system load vector f is represented by a 2D array SF(l:n,l:m) and the vectors u, 
u, and u are similarly represented by 2D arrays SU, SV, and SA, respectively. This representation 
allows parallelization across harmonic terms to be performed in the outermost loop. It also makes 
the passing of references to subroutines an easy task. 

To serve as an example, we consider the DO 200 loop where the computations inside the loop 
are now translated into subroutines as shown below (the DO \ 00 loop follows the same approach). 

CVD$L CNCALL ! an Alliant directive 

DO 200 L = 1, m ! concurrent, one CPU per iteration 

CALL GenAss (SK(1,1,L), SM(1,1,L), SF(1,L), L, n, nbk, nbm, ns, ...) 

CALL Initialize (SU(1,L), SV(1,L), SA(1,L), ...) ! Initialize u 0 , u 0 , and u 0 - 
CALL Form (SK(1,1,L), SM(1,1,L), n, nbk, nbm, aO) ! Form K" and overwrite SK. 
CALL Factorize (SK(1,1,L), n, nbk) ! Factorize K" and overwrite SK. 

END 200 

where GenAss is a subroutine performing the task of the DO 100 loop in the previous pseudo code. 
The other three subroutines are self-explanatory. In the above code, the argument ns denotes the 
number of strips N, and aO is the Newmark constant a 0 . Using this approach, each processor will 
have an identical local copy, automatically generated by the compiler, of the subroutines inside the 
loop and its own reference space (via the index L) in locating K H , M 1 *, and f*; yielding concurrent 
execution for all harmonic terms because distinct processors will hold different values of L. This 
not only prevents memory access conflicts in performing these tasks but also enables us to use a 
single set of subroutines for all harmonic terms. The same applies to the other three subroutines as 
well. Note that the index L is also passed to the subroutine GenAss as a local variable because it 
is required for evaluating K' ( ' e) , Mj' e) , and f ( ' e) whose dimensions should be declared inside GenAss 
and will become local variables. 
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Vectorization. To address vectorization, we now turn to the computations for a single har- 
monic term. First we note that the formation of the effective stiffness matrix and effective 
load vector r, and the calculation of \i l and u ; consist mainly of matrix-matrix (vector- vector) 
additions and matrix-vector multiplications and are thus highly vectorizable. The vectorization 
and parallelization of factorizing and solving the linear system K l, u‘ = F have been under 
intensive studies; see [13, 15, 23] for example. In this paper, we shall only focus on approaches to 
vectorizing the generation of and the assemblage of K 11 . The generation of (f^) and 
the assemblage of (f ; ) follow the same way and, thus, need not be discussed. 

There are two approaches to vectorizing the generation of K The first, referred to as 
Vectorization within a Single Strip (VSS), is to generate the entries of in vector mode. This 
approach requires a minimal storage because K| ( e j for all strips can share the same storage of 
a single strip stiffness matrix, which is usually the case for most traditional finite strip or finite 
element programs. The disadvantage is that the vector length available for vectorization is limited 
by the order of the strip stiffness matrix, 6 in our case, which is rather small. In addition, the 
generation step may not even involve any loop structure because most of the Fortran statements 
may simply be assignment statements when the entries of are explicitly integrated. Therefore, 
we resort to the second approach: Vectorization across Multiple Strips (VMS). This approach 
generates the matrix entries component-wise across many different strips by employing the fact 
that each strip matrix can be generated independently of the others. It, however, requires a 
manual change in the data structure of the strip matrix in the computer program because current 
code restructures can hardly accomplish this task automatically. One way of achieving our goal 
is to add one more dimension (preferably the first dimension) to the array that stores a strip 
matrix so that the new array can store all strip stiffness matrices. For example, let EKL(1:6,1:6) 
be the array used in the VSS approach for storing a single strip stiffness matrix and be shared 
by all strips, one at a time. (For simplicity, we ignore the symmetry of the matrix.) When the 
VMS approach is employed, we can simply change EKL to a 3D array, say EKL(l:ns,l:6,l:6), so 
that the first dimension is associated with strip identifications, allowing vector execution to be 
performed across strips. Although the change in data structure may impose some programming 
difficulty in modifying an existing code, this approach indeed provides a very good way for both 
vectorization and parallelization. 

So far as the assemblage of the I th system stiffness matrix K ,f is concerned, both VSS and 
VMS are still applicable if potential data dependencies are avoided. Note that assemblying an 
entry of Kj^ to K” has no conflict with assemblying the other entries of the same matrix to K (i . 
Vectorization obviously can be performed within any single strip matrix without any difficulty, 
subject to the same disadvantage of short vector length as the case in the generation step. The 
following Fortran code indicates where vectorization can be performed using VSS for assemblying 
the stiffness matrix, where the rows of SKL store the upper diagonals of the band symmetric 
matrix K J/ using the Linpack format [12] with the main diagonal of K ,; stored in the last row of 
SKL. 
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DO 100 I = 1, NBK ! NBK (=6): Semi-bandwidth of K" 

SKL(I, 1:N) = 0.0 (vector) ! Initialization. N: No. of equations of K" 
END 100 

DO 300 K = 1, NS ! NS: No. of strips 

K1 = 3 * (K-l) 

DO 200 J = 1, 6 
J1 = K1 + J 
II = NBK - J + 1 

SKL(I1:NBK, Jl) = SKL(I1:NBK, Jl) -(- EKL(1:J, J) (vector) 

! Vector length too short. 


END 200 

END 300 


Care, however, must be taken when the VMS approach is employed for assembling K^. This is 
because different strips may have some jiodes in common, which amounts to saying that the entries 
of from different strips may contribute themselves to the same location in K ; h Therefore, in 
order to vectorize the assemblage of K" from K[' e) across multiple strip elements, we must find 
a way to avoid potential simultaneous updates of a common matrix entry. A general approach 
to avoid this situation is to use graph coloring techniques to partition strips so that all strips in 
the same group do not contain any common nodes. For our plate problems under consideration, 
two colors are enough: one for odd strips and the other for even strips. When a natural ordering 
is imposed as shown in Figure 2, however, a better approach to enhancing vectorization can 
be employed by assemblying entries component-wise (or node-wise) across all strip elements as 
shown below, assuming the i ifl strip starts from nodal line i to nodal line i + 1 and all strip stiffness 
matrices are available. 

DO 100 1=1, NBK ! NBK (=6): Semi-bandwidth of K a 

SKL(I, 1:N) = 0.0 (vector) ! N: No. of equations of K“ 

END 100 
DO 300 J = 1, 6 

JS = 3 * (NS-1) + J ! NS: No. of strips 

DO 200 I = 1, J 

IJ = NBK - J + I 

SKL(IJ, J:JS:3) = SKL(IJ, J:JS:3) + EKL(1:NS, I, J) (vector) 

END 200 
END 300 


Note that the array EKL now has one dimension more than the one used in the previous code. 
The storage can be reduced by about half if symmetry of the matrix is taken into account. Finally, 
we would like to mention that for a cluster-based multiprocessor with parallelism of three levels 
like the Cedar [11], FSM is a perfect candidate because the decoupling at the system level offers 
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Figure 4: The triangular loading (uniformly distributed on the entire plate). 

a great deal of freedom for the problem to be solved using all levels of parallelism. For example, 
we need exploit only the first two levels of parallelism in a linear system solver instead of three 
because the highest level of parallelism can be employed across multiple linear subsystems. 


NUMERICAL EXPERIMENTS 

To demonstrate the effectiveness and parallelizability of FSM, we consider the dynamic Mindlin 
analysis of a thin steel plate that is simply supported on all of its four edges and is subject to a 
uniformly distributed triangular loading q(t ) as shown in Figure 4. This plate, adapted from [2], 
is 60 inches ( L x ) wide, 40 inches ( L y ) long, and one inch thick throughout the entire plate. The 
material of the plate is assumed to be isotropic with Young’s modulus E = 30 x 10 6 ksi, Poisson 
ratio v = 0.25, and a mass density of m = 0.00073 lb-sec 2 /in 4 . The time step size At is set 
to 0.00001 sec. In evaluating the strip stiffness matrices, reduced integration with one Gaussian 
point is used to overcome the shear locking behavior [18]. The strip mass matrices are evaluated 
using the consistent mass approach. The linear algebraic differential equations are solved using 
the Newmark integration method with parameters a = 0.25 and 6 = 0.50 [3, pp. 311]. A banded 
direct solver is used to solve the resulting linear subsystems in each time step. 

In Figure 5, we compare the numerical solution of the displacement w at the center of the plate 
using 16 Mindlin strip elements with the exact solution (Fourier series) derived from the Kirchhoff 
thin plate theory. Eight harmonic terms are used in the finite strip approximation. From Figure 
5, it is clear that the finite strip solution is in good agreement with the exact solution of the 
Kirchhoff theory. The performance of this method on an Alliant FX/80 is shown in Tables 2 and 
3. In Table 2, we compare the CPU time (all in seconds) consumed in the entire analysis, including 
the generation, assemblage, and solution of the linear algebraic differential equations and finally 
the calculation of the displacements. Three different execution modes: scalar (S), vector (V), and 
vector-concurrent (VC) are considered. The compiler options [1] used for these modes are shown 
in Table 1. 

Table 2 shows the vector speedup (the ratio of the 1-processor CPU time spent under the 
S mode to that under the V mode) for the entire process. As seen from this table, the vector 
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Displacement (inches) 



Simulation time t (seconds) = 


Figure 5: Displacement at the center of the plate. 


Table 1: Compiler options 


Execution mode 

Compiler options 

Subprograms compiled 

Scalar (S) 

-Og -AS -pg 

the entire program 

Vector (V) 

-Ogv -AS -pg 

the entire program 

Vector-Concurrent 

(VC) 

-Ogv -AS 
-Ogvc -AS 

recursively-called subroutines 
others 


Table 2: CPU time (in seconds) and vector speedup on the Alliant FX/80 using one processor. 


Step 

Scalar (S) 

Vector (V) 

S/V 

Remark 

Solve LDL r u = f 

177.1 

137.1 

1.29 

semi-bandwidth too small 

Compute f, u, u (Newmark) 

91.0 

25.3 

3.60 

mainly DAXPY operations. 

Generate and assemble f 

42.7 

12.4 

3.45 

using the VMS approach 

Initialization and I/O 

1.72 

1.70 

1.01 

no manual optimization 

Total 

312.4 

176.4 

1.77 
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Table 3: Parallel performance under the vector-concurrent mode. 


No. of processors k 

1 

2 

4 

8 

CPU time in seconds 

165.7 

84.14 

45.01 

25.08 

Concurrency speedup Sk 

1.00 

1.97 

3.68 

6.61 

Efficiency Ek (%) 

100.0 

98.5 

92.0 

82.6 


Concurrency speedup 



Figure 6: Concurrency speedup on the Alliant FX/80. 




speedups for the three most time-consuming parts: (1) solving Ku = f, (2) computing f, u, and 
ii, and (3) generating f ( ; e) and assemblying f are 1.29, 3.60, and 3.45, respectively. Note that 
the semi-bandwidth of the system stiffness matrix is only 6 in this example, which is obviously 
not long enough for a banded direct linear system solver to take advantage of vector instructions 
in solving the linear system. The vector speedups for the other two parts, however, are very 
satisfactory. It deserves mentioning that in generating and assemblying f, we employed the 
VMS approach which yields a much better vector performance than the VSS approach. Table 3 
shows the concurrency speedup Sk, defined to be the ratio of the CPU time spent under the VC 
execution mode of the entire program using only one processor to that using k processors and 
the efficiency Ek (= Sk/k ), the ratio oFthe concurrency speedup Sk to the number of processors 
k. Figure 6 plots the speedup against the number of processors used. As seen from Table 3, the 
concurrency speedups observed using 2, 4, and 8 processors are 1.97, 3.68, and 6.61, respectively. 
This impressive performance clearly indicates the parallelizability of FSM on multiprocessors when 
the number of harmonic terms used matches the number of processors available. 


CONCLUSIONS 

The effectiveness and parallelizability of the finite strip method (FSM) for the dynamic analysis 
of a class of Mindlin plates have been addressed and vector/parallel implementations presented. 
The performance of this method on the Alliant FX/80 has also been tested using a rectangular 
plate that is simply supported on all edges and is subject to a uniformly distributed triangular 
loading. From the experiments performed, we have obtained concurrency speedups of 1.97, 3.68, 
and 6.61 using 2, 4, and 8 processors, respectively. These speedups are satisfactory and very 
encouraging. It clearly demonstrates the superiority of FSM in a parallel processing environment. 
For vectorization, good performance has also been observed for the Newmark integration scheme 
and for the generation/assemblage process using the VMS (vectorization across multiple strips) 
approach. In summary, we conclude that, although vector performance during the solution stage 
may be hindered by the small semi-bandwidth of the subsystems if a direct solver is employed, FSM 
is highly parallelizable and, therefore, suitable for computation on multiprocessor or multicluster 
computers. This is especially true when the problem requires a large number of harmonic terms 
to yield accurate results. 
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